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Abstract. 

The optimization of large portfolios displays an inherent instability to estimation error. 
This poses a fundamental problem, because solutions that are not stable under sample 
fluctuations may look optimal for a given sample, but are, in effect, very far from 
optimal with respect to the average risk. In this paper, we approach the problem 
from the point of view of statistical learning theory. The occurrence of the instability 
is intimately related to over-fitting which can be avoided using known regularization 
methods. We show how regularized portfolio optimization with the expected shortfall 
as a risk measure is related to support vector regression. The budget constraint 
dictates a modification. We present the resulting optimization problem and discuss the 
solution. The L2 norm of the weight vector is used as a regularizer, which corresponds 
to a diversification "pressure" . This means that diversification, besides counteracting 
downward fluctuations in some assets by upward fluctuations in others, is also crucial 
because it improves the stability of the solution. The approach we provide here allows 
for the simultaneous treatment of optimization and diversification in one framework 
that enables the investor to trade-off between the two, depending on the size of the 
available data set. 
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1. Introduction 

Markowitz' portfolio selection theory [HE] is one of the pillars of theoretical finance. It 
has greatly influenced the thinking and practice in investment, capital allocation, index 
tracking, and a number of other fields. Its two major ingredients are (i) seeking a trade- 
off between risk and reward, and (ii) exploiting the cancellation between fluctuations of 
(anti-)correlated assets. In the original formulation of the theory, the underlying process 
was assumed to be multivariate normal. Accordingly, reward was measured in terms of 
the expected return, risk in terms of the variance of the portfolio. 

The fundamental problem of this scheme (shared by all the other variants that have 
been introduced since) is that the characteristics of the underlying process generating 
the distribution of asset prices are not known in practice, and therefore averages are 
replaced by sums over the available sample. This procedure is well justified as long 
as the sample size, T (i.e. the length of the available time series for each item), is 
sufficiently large compared to the size of the portfolio, N (i.e. the number of items). 
In that limit, sample averages asymptotically converge to the true average due to the 
central limit theorem. 

Unfortunately, the nature of portfolio selection is not compatible with this limit. 
Institutional portfolios are large, with iV's in the range of hundreds or thousands, while 
considerations of transaction costs and non-stationarity limit the number of available 
data points to a couple of hundreds at most. Therefore, portfolio selection works in a 
region, where N and T are, at best, of the same order of magnitude. This, however, is 
not the realm of classical statistical methods. Portfolio optimization is rather closer to 
a situation which, by borrowing a term from statistical physics, might be termed the 
"thermodynamic limit", where N and T tend to infinity such that their ratio remains 
fixed. 

It is evident that portfolio theory struggles with the same fundamental difficulty 
that is underlying basically every complex modeling and optimization task: the high 
number of dimensions and the insufficient amount of information available about the 
system. This difficulty has been around in portfolio selection from the early days and 
a plethora of methods have been proposed to cope with it, e.g. single and multi-factor 
models [3], Bayesian estimators [H El El [71 El El HQl HU H21 [13l HH HSl HEl H7j, or, 
more recently, tools borrowed from random matrix theory [T8l [T9l [20| EH E2l E3]. In 
the thermodynamic regime, estimation errors are large, sample to sample fluctuations 
are huge, results obtained from one sample do not generalize well and can be quite 
misleading concerning the true process. 

The same problem has received considerable attention in the area of machine 
learning. We discuss how the observed instabilities in portfolio optimization (elaborated 
in Section E]) can be understood and remedied by looking at portfolio theory from the 
point of view of machine learning. 

Portfolio optimization is a special case of regression, and therefore can be 
understood as a machine learning problem (see Section [3]). In machine learning, as 
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well as in portfolio optimization, one wishes to minimize the actual risk, which is the 
risk (or error) evaluated by taking the ensemble average. This quantity, however, can 
not be computed from the data, only the empirical risk can. The difference between the 
two is not necessarily small in the thermodynamic limit, so that a small empirical risk 
does not automatically guarantee small actual risk |24j . 

Statistical learning theory [2U [25], [26] finds upper bounds on the generalization 
error that hold with a certain accuracy. These error bounds quantify the expected 
generalization performance of a model, and they decrease with decreasing capacity of 
the function class that is being fitted to the data. Lowering the capacity therefore lowers 
the error bound and thereby improves generalization. The resulting procedure is often 
referred to as regularization and essentially prevents over-fitting (see Section H]). 

In the thermodynamic limit, portfolio optimization needs to be regularized. We 
show in Section [5] how the above mentioned concepts, which find their practical 
application in support vector machines [271 [28] , can be used for portfolio optimization. 
Support vector machines constitute an extremely powerful class of learning algorithms 
which have met with considerable success. We show that regularized portfolio 
optimization, using the expected shortfall as a risk measure, is almost identical to 
support vector regression, apart from the budget constraint. We provide the modified 
optimization problem which can be solved by linear programming. 

In Section [6], we discuss the financial meaning of the regularizer: minimizing 
the L2 norm of the weight vector corresponds to a diversification pressure. We also 
discuss alternative constraints that could serve as regularizers in the context of portfolio 
optimization. 

Taking this machine learning angle allows one to organize a variety of ideas in the 
existing literature on portfolio optimization filtering methods into one systematic and 
well developed framework. There are basically two choices to be made: (i) which risk 
measure to use, and (ii) which regularizer. These choices result in different methods, 
because different optimization problems are being solved. 

While we focus here on the popular expected shortfall risk measure (in Section [5]), 
the variance has a long history as an important risk measure in finance. Several existing 
filtering methods that use the variance risk measure essentially implement regularization, 
without necessarily stating so explicitly. The only work we found in this context [7] that 
mentiones regularization in the context of portfolio optimization has not been noticed by 
the ensuing, closely related, literature. It is easy to show that when the L2 norm is used 
as a regularizer, then the resulting method is closely related to Bayesian ridge regression, 
which uses a Gaussian prior on the weights (with the difference of the additional budget 
constraint). The work on covariance shrinkage, such as [8], [9], [TOj, [TT] , falls into the same 
category. Other priors can be used [17] , which can be expected to lead to different results 
(for an insightful comparison see e.g. [29]). Using the LI norm has been popularized 
in statistics as the "LASSO" (least absolute shrinkage and selection operator) [29], and 
methods that use any Lp norm are also known as the "bridge" [30] . 
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2. Preliminaries — Instability of classical portfolio optimization. 

Portfolio optimization in large institutions operates in what we called the 
thermodynamic limit, where both the number of assets and the number of data points 
are large, with their ratio a certain, typically not very small, number. The estimation 
problem for the mean is so serious [3TJ [32] as to make the trade-off between risk and 
return largely illusory Therefore, following a number of authors [H [9j [331 EU [35], we 
focus on the minimum variance portfolio and drop the usual constraint on the expected 
return. This is also in line with previous work (see [36] and references therein), and 
makes the treatment simpler without compromising the main conclusions. An extension 
of the results to the more general case is straightforward. 

Nevertheless, even if we forget about the expected return constraint, the problem 
still remains that covariances have to be estimated from finite samples. It is an 
elementary fact from linear algebra that the rank of the empirical N x N covariance 
matrix is the smaller of N and T. Therefore, if T < N, the covariance matrix is singular 
and the portfolio selection task becomes meaningless. The point T = N thus separates 
two regions: for T > N the portfolio problem has a solution, whereas for T < N, it 
does not. 

Even if T is larger than N, but not much larger, the solution to the minimum 
variance problem is unstable under sample fluctuations, which means that it is not 
possible to find the optimal portfolio in this way. This instability of the estimated 
covariances, and hence of the optimal solutions, has been generally known in the 
community, however, the full depth of the problem has only been recognized recently, 
when it was pointed out that the average estimation error diverges at the critical point 
N = T [371 EHl [39]. 

In order to characterize the estimation error, Kondor and co-workers used the 
ratio <2o between (i) the risk, evaluated at the optimal solution obtained by portfolio 
optimization using finite data and (ii) the true minimal risk. This quantity is a measure 
of generalization performance, with perfect performance when q% = 1, and increasingly 
bad performance as q$ increases. As found numerically in [38] and demonstrated 
analytically by random matrix theory techniques in [ID] , the quantity go is proportional 
to (1 — N/T)~ l l 2 and diverges when T goes to N from above. 

The identification of the point N = T as a phase transition [36l |4"T] allowed for 
the establishment of a link between portfolio optimization and the theory of phase 
transitions, which helped to organize a number of seemingly disparate phenomena into 
a single coherent picture with a rich conceptual content. For example, it has been shown 
that the divergence is not a special feature of the variance, but persists under all the 
other alternative risk measures that have been investigated so far: historical expected 
shortfall, maximal loss, mean absolute deviation, parametric VaR, expected shortfall, 
and semivariance [36l [JTJ H2J H3]. The critical value of the N/T ratio, at which the 
divergence occurs, depends on the particular risk measure and on any parameter that 
the risk measure may depend on (such as the confidence level in expected shortfall). 
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However, as a manifestation of universality, the power law governing the divergence 
of the estimation error is independent of the risk measure [36l UH H2], the covariance 
structure of the market [39], and the statistical nature of the underlying process |Hj. 
Ultimately, this line of thought led to the discovery of the instability of coherent risk 
measures [^5] . 

3. Statistical reasons for the observed instability in portfolio optimization 

As mentioned above, for simplicity and clarity of the treatment we do not impose a 
constraint on the expected return, and only look for the global minimum risk portfolio. 
This task can be formalized as follows: Given a fixed budget, customarily taken to 
be unity, given T past measurements of the returns of N assets: x%, i = 1,...,N, 
k = lj_ . . . , T, and given the risk functional F(w-x), find a weighted sum (the portfolio), 
w ■ xji| such that it minimizes the actual risk 



under the constraint that ^ iUj = 1. The central problem is that one does not know the 
distribution p(x), which is assumed to underly the generation of the data. In practice, 
one then minimizes the empirical risk, replacing ensemble averages by sample averages: 



Now, let us interpret the weight vector as a linear model. The model class given by the 
linear functions has a capacity h, which is a concept that has been introduced by Vapnik 
and Chervonenkis in order to measure how powerful a learning machine is [2^1 [25l [26] . (In 
the statistical learning literature, a learning machine is thought of as having a function 
class at its disposal, together with an induction principle and an algorithmic procedure 
for the implementation thereof [IS]). The capacity measures how powerful a function 
class is, and thereby also how easy it is to learn a model of that class. The rough idea is 
this: a learning machine has larger capacity if it can potentially fit more different types 
of data sets. Higher capacity comes, however, at the cost of potentially over-fitting 
the data. Capacity can be measured, for example, by the Vapnik-Chervonenkis (VC-) 
dimension [21] , which is a combinatoric measure that counts how many data points can 
be separated in all possible ways by any function of a given class. 

To make the idea tangible for linear models, focus on two dimensions (N = 2). For 
each number of points, n, one can choose the geometrical arrangement of the points in 
the plane freely. Once it is chosen, points are labeled by one of two labels, say "red" 
and "blue" . Can a line separate the red points from the blue points for any of the 2 n 
different ways in which the points could be colored? The VC-dimension is the largest 
number of points for which this can be done. Two points can trivially be separated 
by a line. Three points that are not arranged collinear can still be separate for any of 






(2) 



k=l 



| Notation: bold face symbols are understood to denote vectors. 
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the 8 possible labelings. However, for four points this is no longer the case, since there 
is no geometrical arrangement for which one could not find a labeling that can not be 
separated by a line. The VC-dimension is 3, and in general, for linear models in N 
dimensions, it is N + 1 [1H1 l4"Tj . 

In the regime in which the number of data points are much larger than the capacity 
of the learning machine, h/T << 1, a small empirical risk guarantees small actual risk 
[21]. For linear functions through the origin that are otherwise unconstrained, the VC- 
dimension grows with N. In the thermodynamic regime, where N/T is not very small, 
minimizing the empirical risk does not necessarily guarantee a small actual risk [24|. 
Therefore it is not guaranteed to produce a solution that generalizes well to other data 
drawn from the same underlying distribution. 

In solving the optimizing problem that minimizes the empirical risk, Eq. (J2J) in the 
regime in which N/T is not very small, portfolio optimization over- fits the observed data. 
It thereby finds a solution that essentially pays attention to the seeming correlations 
in the data which come from estimation noise due to finite sample effects, rather than 
from real structure. The solution is thus different for different realizations of the data, 
and does not necessarily come close to the actual optimal portfolio. 

4. Overcoming the instability 

The generalization error can be bounded from above (with a certain probability) by 
the empirical error plus a confidence term that is monotonically increasing with some 
measure of the capacity, and depends on the probability with which the bound holds 
[4"8] . Several different bounds have been established, connected with different measures 
of capacity, see e.g. [47] . 

Poor generalization and over-fitting can be improved upon by decreasing the 
capacity of the model [25] 126], which helps to lower the generalization error. Support 
vector machines are a powerful class of algorithms that implement this idea. 

We suggest that if one wants to find a solution to the portfolio optimization problem 
in the thermodynamic regime, then one should not minimize the empirical risk alone, 
but also constrain the capacity of the portfolio optimizer (the linear model). 

How can portfolio optimization be regularized? Portfolio optimization is essentially 
a regression problem, and therefore we can apply statistical learning theory, in particular 
the work on support vector regression. 

Note first that the capacity of a linear model class for which the length of the 
weight vector is restricted to \\w\\ 2 < A has an upper bound which is smaller than 
the capacity of unconstrained linear models [25l [26] . The capacity is minimized when 
the length of the weight vector is minimized [251 126] . Vapnik's concept of structural 
risk minimization [H] results in the support vector algorithm [27J [28] which finds the 
model with the smallest capacity that is consistent with the data, that is the model 
with smallest ||u>|| 2 . This leads to a convex constrained optimization problem [27] 128] 
which can be solved using linear programming. 
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5. Regularized portfolio optimization with the expected shortfall risk 
measure. 

While the original Markowitz' formulation [I] measures risk by the variance, many other 
risk measures have been proposed since. Today, the most widely used risk measure, both 
in practice and in regulation, is Value at Risk ( VaR) [191 150] . VaR has, however, been 
criticized for its lack of convexity, see e.g. [5H [52J, [53], and an axiomatic approach, 
leading to the introduction of the class of coherent risk measures, was put forward |51j . 
Expected shortfall, essentially a conditional average measuring the average loss above a 
high threshold, has been demonstrated to belong to this class jSU [551 156] - 

Expected shortfall has been steadily gaining popularity in recent years. The 
regularization we propose here is intended to cure its weak point, the sensitivity to 
sample fluctuations, at least for reasonable values of the ratio N/T. 

Choose the risk functional F(z) = z8(z — ojg), where ap is a threshold, such that 
a given fraction f3 of the (empirical) loss-distribution over z lies above a p. One now 
wishes to minimize the average over the remaining tail distribution, containing the 
fraction v :— 1 — /3, and defines the expected shortfall as 
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The term in the sum implements the ^-function, while v in the denominator ensures 
normalization of the tail distribution. It has been pointed out [57] that this optimization 
problem maps onto solving the linear program: 

T 

(4) 



mm 



1 T 



PC 



k=l 



s.t. w-x (fe) + e + a >0; £ fc ; > 



W; 



1. 



(5) 
(6) 



We propose to implement regularization by including the minimization of ||w|| 2 . This 
can be done using a Lagrange multiplier, C, to control the trade-off - as we relax the 
constraint on the length of the weight vector, we can, of course, make the empirical 
error go to zero and retrieve the solution to the minimal expected shortfall problem. 
The new optimization problem reads: 



mm 



w 



c 



s.t. - w • x (fc) < e + £ fe ; 
£fc>0; e>0; 

w i = L 



(7) 
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The problem is mathematically almost identical to a support vector regression (SVR) 
algorithm called i^-SVR. There are two differences: (i) the budget constraint is added, 
and (ii) the loss function is asymmetric. Expected shortfall is an asymmetric version 
of the e-intensive loss, used in support vector regression, defined as the maximum of 
{0; |/(x) — y\ — e}, where /(x) is the interpolant, and y the measured value (response). 
In that sense e measures an allowable error below which deviations are discardedjjj] 

The use of asymmetric risk measures in finance is motivated by the consideration 
that investors are not afraid of upside fluctuations. However, to make the relationship 
to support vector regression as clear as possible, we will first solve the more general 
symmetrized problem, before restricting our treatment to the completely asymmetric 
case, corresponding to expected shortfall. In addition, one may argue that focusing 
exclusively on large negative fluctuations might not be advisable even from a financial 
point of view, especially when one does not have sufficiently large samples. In a relatively 
small sample it may happen that a particular item, or a certain combination of items, 
dominates the rest, i.e. produces a larger return than any other item in the portfolio 
at each time point, even though no such dominance exists on longer time scales. The 
probability of such an apparent arbitrage increases with the ratio N/T, and when it 
occurs it may encourage an investor acting on a lopsided risk measure to take up very 
large long positions in the dominating item(s), which may turn out to be detrimental 
on the long run. This is the essence of the argument that has led to the discovery of 
the instability of coherent and downside risk measures [^3l H5] . 

According to the above, let us consider the general case where positive deviations 
are also penalized. The objective function, Eq. ([7j), then becomes 



This problem corresponds to Z/-SVR, a well understood regression method [60], with 
the only difference that the budget constraint, Eq. ( FlOl) is added here. In the finance 
context the associated loss might be called symmetric tail average (STA). Solving the 
regularized expected shortfall minimization problem, Eqs. (iTjl-lfTUl) is a special case of 
solving the regularized STA minimization problem, Eq. ([IT]) with the constraints Eqs. 
(j8l)-(fT0l) and (TT2]) . Therefore, we solve the more general problem first (Section 15.11) . 
before providing, in Section I5T21 the solution to the regularized expected shortfall, Eqs. 



§ The mathematical similarity between minimum expected shortfall without regularization and the E^- 
SVM algorithm [58j was pointed out, but incorrectly, in |59j . There is an important difference between 
the two optimization problems. In E^-SVM, the length of the weight vector, ||w||, is constrained, which 
implements capacity control. In the pure expected shortfall minimization, Eq. (TJ)), this is not done. 
Instead, the total budget ^\ wi is fixed. This difference is not correctly identified in the proof of the 
central theorem (Theorem 1) in [59] . 
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5.1. Regularized Symmetric Tail Average Minimization 



The solution to the regularized symmetric tail average problem, Eq. i fTTj) with 
the constraints Eqs. (I8i)-( ll~0l) and ( lT2i) . is found in analogy to support vector 
regression, following [60], by writing down the Lagrangean, using Lagrange multipliers, 
{a, a*, 7, A, t], t]*}, for the constraints. The solution is then a saddle point, i.e. minimum 
over primal and maximum over dual variables. The Lagrangean is different from the 
one that arises in Z/-SVR in that it is modified by the budget constraint: 
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where 1 denotes the unit vector of length N. Setting the derivative of the Lagrangian 
w.r.t. w to zero gives: 

T 



w, 



opt 



r(*) 



7 1 



(16) 



This solution for the optimal portfolio is sparse in the sense that, due to the Karush- 
Kuhn- Tucker conditions (see e.g. [61]), only those points contribute to the optimal 
portfolio weights, for which the inequality constraints in ([8]), and the corresponding 
constraints in Eq. (fT2|) . are met exactly. The solution of w opt contains only those 
points, and effectively ignores the rest. This sparsity contributes to the stability of 
the solution. Regularized portfolio optimization (RPO) operates, in contrast to general 
regression, with a fixed budget. As a consequence, the Lagrange multiplier 7 now 
appears in the optimal solution, Eq. ( flBI) . Compared to the optimal solution in support 
vector (SV) regression, w sv , the solution vector under the budget constraint, w RPO , is 
shifted by 7: 

wrpo = w sv - 7I. (17) 

Let us now consider the dual problem. The dual is, in general, a function 
of the dual variables, which are here {a, a*, 7, A, 77, 77*}, although we will see in 
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the following that some of these variables drop out. The dual is defined as 
D : = min w ^ g* e L[w, £, £*, e, a, a*, 7, A, 77, 77*], and the dual problem is then to maximize 
D over the dual variables. We can replace the minimization over w by evaluating the 
Lagrangian at w opt . For that we have to evaluate 



F[w, 



opt J 



opt II 
\k=l 



For the other terms in the Lagrangian, we have to consider different cases: 

(i) if (cv-x- ELi( a * + «*)) < °> then L 
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(ii) If (Cz/ - A - ELiK + > 0: The term e(C/v-A-ELi( a fe + a fc)) 

vanishes. Reason: if equality holds, this is trivially true, and if the inequality 
holds strictly then L can be minimized by setting e = 0. 

Similarly, for the other constraints (the notation (*) means that this is true for variables 
with and without the asterisk): 
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0. If equality holds then it is 



Vk') > °> then & (f ~a k ' -V k 
holds strictly then L can be minimized by ^ 
trivially true. 

By a similar argument, the term 7 in Eq. (lTlj) disappears in the Dual. Altogether we 
have that either D = — 00, or 



D(a, a*, 7) = min F[w opt (a, a*, 7)] 
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Note that the variables ^\t] ( ^\e, X do not appear in F[w opt (a, a*, 7)]. The dual 
problem is therefore given by 
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We can analytically maximize over 7 and obtain for the optimal value 

7=4(E(«*-«;)E*? ) -i) ( 26 ) 

\fc=l i=l / 

The optimal projection (= optimal portfolio) is given by 



W opt 



T 1 / T N \ 

• x = J> fe - a*)x« • x - - J> fe - a* k ) ^ - 1 1 ■ x. (27) 

fc=l \fc=l i=l / 

For iV — > 00 the second term vanishes and the solution is the same as the the solution 
in support vector regression. Note that the kernel-trick (see e.g. [37]), which is used 
in support vector machines to find nonlinear models hinges on the fact that only dot 
products of input vectors appear in the support vector expansion of the solution. As a 
consequence of the budget constraint, one can no longer use the kernel-trick (compare 
Eq. (1271) ). As long as we disregard derivatives, this is not a problem for portfolio 
optimization. Keep in mind, however, that the budget constraint introduces this 
otherwise undesirable property. 

Support vector algorithms typically solve the dual form of the problem (for a recent 
survey see |62]). which is in our case given by 

T T / „ N N 
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For N — * 00 the problem becomes identical to Z/-SVR, which can be solved by linear 
programming, for which software packages are available [63] . For finite N, it can still 
be solved with existing methods, because it is quadratic in the a^'s. Solvers such as 
the ones discussed in [HI] and can be used, but have to be adapted to this specific 
problem. 

The regularized symmetric tail average minimization problem (Eq. ffTT]) with the 
constraints Eqs. (I8l- (ITD1) and (fT2]l ) is, as we have shown here, directly related to support 
vector regression which uses the e-insensitive loss function. The e-insensitive loss is stable 
to local changes for data points that fall outside the range specified by e. This point 
is elaborated in Section 3 in [60], and relates this method to robust estimation of the 
mean. It can also be extended to robust estimation of quantiles [60] by scaling of the 
slack variables by /x and by 1 — fi, respectively. 

This scaling translates directly to the portfolio optimization problem, which is an 
extreme case: downside risk measures penalize only loss, not gain. The asymmetry in 
the loss function corresponds to fi = 1. 
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5.2. Regularized expected shortfall. 

By this final change we arrive at the regularized portfolio optimization problem, Eqs. 
(171)-(|T0i). which we originally set out to solve. This is now easily solved in analogy to 
the previous paragraphs: the slack variables disappear, together with the respective 
Lagrange multipliers which enforce constraints, including a* k . The optimal solution is 
now 

T 

w opt = a ^ (k) - 7l, (29) 
fc=i 



with 



N 

1 / 

7 



-1). ( 30 ) 

,fe=i 

The dual problem is given by 

T T / „ N N 
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"fe 2 



k=l 



s.t. at G 



5^ « fc < CV (31) 

k=l 

which, like its symmetric counterpart, Eq. (128]) . can be solved by adjusting existing 
algorithms. 

The formalism provides a free parameter, C, to set the balance between the original 
risk function and the regularizer. Its choice may depend on a number of factors, such 
as the investors time horizon, the nature of the underlying data, and, crucially, on the 
ratio N/T. Intuitively, there must be a maximum allowable value C mSuX (N/T) for C, 
such that when one puts more emphasis on the data, C > C max (A r /T), then over fitting 
will occur with high probability. It would be desirable to know an analytic expression 
for (a bound on) C max (N/T). In practice, cross-validation methods are often employed 
in machine learning to set the value of C. Those methods are not free of problems (see, 
for example, the treatment in [65]), and the optimal choice of this parameter remains 
an open problem. 

6. Regularization corresponds to portfolio diversification. 

Above, we have controlled the capacity of the linear model by minimizing the L2 norm 
of the portfolio weight vector. In the finance context, minimizing 



(32) 



Iwll 2 



corresponds roughly to maximizing the effective number of assets, N e g, i.e. to exerting 
a pressure towards portfolio diversification [66]. We conclude that diversification of the 
portfolio is crucial, because it serves to counteract the observed instability by acting as 
a regularizer. 
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Other constraints that penalizes the length of the weight vector could alternatively 
be considered as a regularizer, in particular any Lp norm. The budget constraint alone, 
however, does not suffice gularizer, since it does not constrain the length of 

the weight vector. Adding a ban on short selling, Wi > 0, to the budget constraint, 
J2i w i = I5 limits the allowable solutions to a finite volume in the space of weights and 
is equivalent to requiring that £\ \wi\ < ljj] It thereby imposes a limit on the LI norm, 
that is on the sum of the absolute amplitudes of long and short positions. 

One may argue that it may be a good idea to use the LI norm instead of the 
L2 norm, because that may make the solution sparser. However, the LI norm has a 
tendency to make some of the weights vanish. Indeed, it has been shown that in the 
orthonormal design case (using the variance as the risk measure) an LI regularizer will 
set some of the weights to zero, while an L2 regularizer will scale all the weights |29j . 
The spontaneous reduction of portfolio size has also been demonstrated in numerical 
simulations |67j: as one goes deeper and deeper into the regime where T is significantly 
smaller than N, under a ban on short selling, more and more of the weights will become 
zero. The same "freezing out" of the weights has been observed in portfolio optimization 
[68] as an empirical fact. 

It is important to stress that the vanishing of some of the weights does not reflect 
any structural property of the objective function, it is just a random effect: as clearly 
demonstrated by simulations [67], for a different sample a different set of weights 
vanishes. The angle of the weight vector fluctuates wildly from sample to sample. 
(The behavior of the solutions is similar for other limit systems as well.) This means 
that the solutions will be determined by the limit system and the random sample, 
rather than by the structure of the market. So the underlying instability is merely 
"masked", in that the solutions do not run away to infinity, but they are still unstable 
under sample fluctuations when T is too small. As it is certainly not in the interest 
of the investor to obtain a portfolio solution which sets weights to zero on the basis 
of unreliable information from small samples, the above observations speak strongly in 
favor of using the L2 norm over the LI norm. 

7. Conclusion 

We have made the observation that the optimization of large portfolios minimizes the 
empirical risk in a regime where the data set size is similar to the size of the portfolio. 
In that regime, a small empirical risk does not necessarily guarantee a small actual risk 
[2"3] . In this sense naive portfolio optimization over-fits the data. Regularization can 
overcome this problem by reducing the capacity of the considered model class. 

Regularized portfolio optimization has choices to make, not only about the risk 
function, but also about the regularizer. Here, we have focussed on the increasingly 
popular expected shortfall risk measure. Using the L2 norm as a regularizer leads 
to a convex optimization problem which can be solved with linear programming. We 

|| This point has been made independently by |17j . 
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have shown that regularized portfolio optimization is then a variant of support vector 
regression. The differences are an asymmetry, due to the tolerance to large positive 
deviations, and the budget constraint, which is not present in regression. 

Our treatment provides a novel insight into why diversification is so important. The 
L2 regularizer implements a pressure towards portfolio diversification. Therefore, from 
a statistical point of view, diversification is important as it is one way to control the 
capacity of the portfolio optimizer and thereby to find a solution which is more stable, 
and hence meaningful. 

In summary, the method we have outlined in this paper allows for the unified 
treatment of optimization and diversification in one principled formalism. It shows how 
known methods from modern statistics can be used to improve the practice of portfolio 
optimization. 
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